TASM: Top-k Approximate Subtree Matching
ثبت نشده
چکیده
We consider the Top-k Approximate Subtree Matching (TASM) problem: finding the k best matches of a small query tree, e.g., a DBLP article with 15 nodes, in a large document tree, e.g., DBLP with 26M nodes, using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic runtime and quadratic space complexity, and, thus, do not scale. Our solution is TASMpostorder, a memory-efficient and scalable TASM algorithm. We prove an upper-bound for the maximum subtree size for which the tree edit distance needs to be evaluated. The upper bound depends on the query and is independent of the document size and structure. A core problem is to efficiently prune subtrees that are above this size threshold. We develop an algorithm based on the prefix ring buffer that allows us to prune all subtrees above the threshold in a single postorder scan of the document. The size of the prefix ring buffer is linear in the threshold. As a result, the space complexity of TASM-postorder depends only on k and the query size, and the runtime of TASM-postorder is linear in the size of the document. Our experimental evaluation on large synthetic and real XML documents confirms our analytic results. DOI: https://doi.org/10.1109/ICDE.2010.5447905 Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-44570 Accepted Version Originally published at: Augsten, N; Böhlen, M; Barbosa, D; Palpanas, T (2010). TASM: Top-k Approximate Subtree Matching. In: IEEE 26th International Conference on Data Engineering (ICDE), 2010, Long Beach, California, USA, 1 March 2010 6 March 2010, 353-364. DOI: https://doi.org/10.1109/ICDE.2010.5447905 University of Zurich Zurich Open Repository and Archive Winterthurerstr. 190
منابع مشابه
TASM: Top-k Approximate Subtree Matching
We consider the Top-k Approximate Subtree Matching (TASM) problem: finding the k best matches of a small query tree, e.g., a DBLP article with 15 nodes, in a large document tree, e.g., DBLP with 26M nodes, using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic runti...
متن کاملA Hybrid Approximate XML Subtree Matching Method Using Syntactic Features and Word Semantics
With the exponential increase in the amount and size of XML data on the Internet, XML subtree matching has become important for many application areas such as change detection, keyword retrieval and knowledge discoveries over XML documents. In our previous work, we have proposed leaf-clustering based approximate XML subtree matching methods using syntax information of both the clustered leaf no...
متن کاملFast Algorithms for Top-k Approximate String Matching
Top-k approximate querying on string collections is an important data analysis tool for many applications, and it has been exhaustively studied. However, the scale of the problem has increased dramatically because of the prevalence of the Web. In this paper, we aim to explore the efficient top-k similar string matching problem. Several efficient strategies are introduced, such as length aware a...
متن کاملFinding Top-k Approximate Answers to Path Queries
We consider the problem of finding and ranking paths in semistructured data without necessarily knowing its full structure. The query language we adopt comprises conjunctions of regular path queries, allowing path variables to appear in the bodies and the heads of rules, so that paths can be returned to the user. We propose an approximate query matching semantics which adapts standard notions o...
متن کاملState Complexity of Regular Tree Languages for Tree Pattern Matching
We study the state complexity of regular tree languages for tree matching problem. Given a tree t and a set of pattern trees L, we can decide whether or not there exists a subtree occurrence of trees in L from the tree t by considering the new language L′ which accepts all trees containing trees in L as subtrees. We consider the case when we are given a set of pattern trees as a regular tree la...
متن کامل